How data becomes knowledge

Bibliothek

The modern age's passion for collecting has a name: Big Data. Science must follow certain rules and use the right infrastructure to unearth the treasure trove of data. Two examples from Forschungszentrum Jülich show how it's done.

People like to collect. For some, collecting and safekeeping are even part of their profession. Scientists are among them. They not only accumulate knowledge, but also produce and record vast amounts of data and information from which knowledge has yet to emerge.

Modern technology makes the collection of data immensely easier and tempts people to hoard it. Experts estimate that 90 per cent of the world's available data has only been compiled in the last two years. In science, ever more precise experimental equipment, measuring systems and computer simulations are generating ever larger mountains of data. At CERN, the European Organisation for Nuclear Research, experiments generate about 50 petabytes of data a year, while the Jülich Supercomputing Centre (JSC) generates about 20 petabytes of data a year for simulations alone. With such data volumes, experts speak of big data, data that can no longer be analysed using manual and conventional methods.

Big Data is far more than a collection passion. It offers the chance to uncover new correlations and patterns from the many data that would not be apparent in small samples. But "big" alone does not bring new insights. Sorting, filtering, evaluating, but also sharing and exchanging data - these are the major challenges to turn Big Data into Smart Data, i.e. data from which meaningful information can be extracted. These challenges place new demands on the IT infrastructure, the handling of data and collaboration in science. Two Jülich examples show what Big Data means in everyday research.

1.4 billion measurement points

Old hands in the Big Data business are Dr Andreas Petzold and his colleagues from the Jülich Institute for Energy and Climate Research (IEK-8). "In the IAGOS project, our measuring instruments have been travelling around the world on commercial airliners for over 20 years. They take a measurement every 4 seconds during the flight: for example, to record greenhouse gases such as carbon dioxide and methane, other reactive trace gases such as ozone, carbon monoxide and nitrogen oxides, but also fine dust as well as ice and cloud particles. In that time, more than 1.4 billion measurement points have been collected on 320 million flight kilometres." It goes without saying that such quantities of data can no longer be analysed with a pocket calculator. The scientists had to develop customised software packages, for example for calibrating the instruments, transmitting the data and evaluating them.

The climate data are used to identify long-term trends, for example in air pollution or greenhouse gases. Such global data and findings are of interest to researchers worldwide. Particularly in the case of climate, it is important to bring together data from different sources - also because numerous factors are intertwined here: Soils, plants, animals, microorganisms, waters, atmosphere and everything that humans do.

Making data comparable

Up to now, such data has too often been collected separately from one another and also packed separately into models. This is set to change: At the European level, several large-scale infrastructure projects started at the beginning of 2019, which not only aim to secure the individual data treasures in a long-term and well-structured manner, but also to make them comparable.

When researchers from different disciplines, institutions and countries pile up their mountains of data into even bigger mountains of data, one thing is needed first: common standards. So far, these have been too rare. Such rules should range from the collection methods in the field, to the quality assurance of measurements, to the verifiability of the data. These standards already exist within projects: "In the IAGOS project, for example, we provide each measurement point with a whole series of metadata. These are, so to speak, the keywords for each measurement: what, when, how and where, temperature, flight number and measuring device. This also allows external or subsequent researchers to understand what we measured, how and where," Petzold emphasises.

When researchers from different disciplines, institutions and countries pile up their mountains of data into even bigger mountains of data, one thing is needed first: common standards.

EU mammoth project for 19 million euros

Now cross-project standards are needed. This is precisely what ENVRI-FAIR, the European infrastructure project for environmental sciences, wants to introduce. ENVRI stands for Environmental Research Infrastructures, because all established European infrastructures of Earth system research are involved in the project - from local measuring stations to mobile devices like those of IAGOS to satellite-based systems. FAIR describes the demands on how researchers should collect and store the vast amounts of data in the future: findable, accessible, interoperable and reusable.

Petzold is coordinating this mammoth project, which is being funded by the EU for four years with 19 million euros. "ENVRI-FAIR will enable us to link different data and relate them to each other - the basis for turning our big data into smart data that can be used for research, innovation and society," he emphasises. To ensure that as many researchers as possible can access the data treasures, open access is planned via the European Open Science Cloud, which is currently being set up, as with all other European infrastructure projects.

To realise such ambitious plans, the specialist scientists need the support of IT specialists - for example, for the upcoming expansion of IT infrastructures and data management and computer centres. At Forschungszentrum Jülich, the Jülich Supercomputing Centre (JSC) is available as a partner with extensive expertise: Among other things, it offers two supercomputers, suitable computing methods, enormous storage capacities of several hundred petabytes and around 200 experts on a wide range of topics. The JSC supports ENVRI-FAIR, for example, in setting up an automated management of the large data streams. One of the main topics is data access. Because today it is increasingly a question of ensuring that large data sets - and the conclusions drawn from them - can be examined and verified by all the research groups involved in international projects with many cooperation partners.

Teamwork in Simulation Laboratories

To this end, new computer architectures are being developed in Jülich that can handle and evaluate Big Data particularly well, such as JUWELS and DEEP. In order to improve the exchange between specialists for high-performance computers and technical scientists, the JSC has also set up so-called Simulation Laboratories in which the various experts work closely together. They support researchers in the general handling of Big Data and in its evaluation - also with the help of machine learning.

"The experts for machine learning and the specialists for high-performance computers know how to evaluate large amounts of data with the supercomputers. For their part, specialised scientists such as biologists, physicians or materials scientists can ask the meaningful questions of their specific data and have the generated answers evaluated. With such collaboration, adaptive models - such as deep neural networks - can be trained with the available data to predict processes in the atmosphere, in biological systems, in materials or in a fusion reactor," explains Dr Jenia Jitsev, a specialist in deep learning and machine learning at JSC.

One of the Jülich researchers working closely with the JSC is Dr Timo Dickscheid, head of the Big Data Analytics working group at the Jülich Institute for Neuroscience and Medicine (INM-1). His institute also accumulates an enormous amount of data, because it is about the most complex structure of the human being: the brain. "We are developing a three-dimensional model for it that takes into account both structural and functional organisational principles," says the computer scientist.

He has already collaborated on BigBrain, a 3D model assembled from microscope images of tissue sections of the human brain. For this, the Jülich brain researchers, together with a Canadian research team, had prepared and digitised 7 404 wafer-thin sections in over 1 000 hours of work.

Surfing through the brain

"This 3D brain model is around one terabyte in size," reports Dickscheid, "so it's already a challenge to display the image data set smoothly on the screen - not to mention complex image analysis procedures that automatically analyse this data set on the Jülich supercomputers and thus add three-dimensional maps of the different brain areas piece by piece." A manual, complete drawing of these areas by the scientists is no longer feasible with these data sizes. For three years, he and his colleagues have been intensively programming and exchanging information with the JSC.

The result: despite the large database, the programme makes it possible to smoothly surf through the brain and zoom down to the level of cell clusters. The trick: "We don't provide the user with the entire data set in full resolution, but only the small part they are looking at," explains Dickscheid. "And in real time," he adds. The BigBrain model and the 3D maps are a prime example of shared Big Data. They can now be clicked, rotated, zoomed and marvelled at by anyone on the internet.

Scientists from all over the world take advantage of this. Because the three-dimensional representation enables them to assess spatial relationships in the complicated architecture of the human brain far better than before - and gain new insights. Dutch scientists, for example, want to use the atlas to better understand the human visual cortex at the cellular level and use this knowledge to refine neuroimplants for the blind.

JUWELS and DEEP

JUWELS is a highly flexible and modular supercomputer whose adaptable computer design was developed in Jülich and is aimed at an extended range of tasks - from Big Data applications to computationally complex simulations. The abbreviation stands for "Jülich Wizard for European Leadership Science". Jülich researchers are also developing new modular supercomputer architectures within the European DEEP projects that can be used even more flexibly and efficiently than previous systems for scientific applications.

"We need to agree in the research community that in future the authors of the data will be named on an equal footing with the authors of a scientific publication."

Katrin Amunts, Director Institute for Neuroscience and Medicine

"Making results like our various brain maps available to everyone is a cornerstone of science," says Professor Katrin Amunts, Director at the Institute of Neuroscience and Medicine and Dickscheid's boss. However, making the underlying data publicly available forces a paradigm shift in research: "At the moment, publications of scientific studies still play a much bigger role than publications of data. We need to agree in the research community that the authors of the data should be named and cited on an equal footing with the authors of a scientific publication. Here, too, FAIR data is a very central point; data should be findable, accessible, interoperable and reusable; an approach that the Human Brain Project is actively promoting," Amunts emphasises. After all, publications are the currency with which research is traded and careers are made.

Sharing as an opportunity

Astrophysicists are considered a shining example. "Here it has grown historically that data is shared without bias," knows Dr Ari Asmi from the University of Helsinki, colleague of Andreas Petzold and co-coordinator of ENVRI-FAIR. The most recent example is the sensational photo of the black hole. "That was only possible because, firstly, the global scientific community in radio astronomy is extremely closely networked and, secondly, a radio telescope can only be used if you subsequently disclose the data you obtain with it."

Asmi sees the sharing of big data as a great opportunity for research: "It will be really exciting if we manage to combine data from different disciplines using the new methods, for example our environmental calculations with data from the social, political and economic sciences. If we succeed in doing this, we will have viable models to understand climate change in its entirety, for example, and to be able to create concepts for action for the future." And then collecting becomes not just Big, but really Smart.


Author: Brigitte Stahl-Busse

This article first appeared in effzett - Das Magazin aus dem Forschungszentrum Jülich.

Alternativ-Text

Subscribe newsletter